[My Github link] https://github.com/sydneydlu98/DSI_Data_Challenge_5
In this Data Challenge, we will be clustering foods from the nndb_flat dataset provided on Canvas. To load/clean the data as well as perform some exploratory analysis:
## load all the packages
library(readr)
library(dplyr)
library(GGally)
library(tidyverse)
library(plotly)
## read in data
data <- read_csv("nndb_flat.csv")
## filter the data to only contain food groups of Vegetables and Vegetable Products, Beef Products, and Sweets
object <- c("Sweets", "Beef Products", "Vegetables and Vegetable Products")
clean_data <- data %>%
subset(FoodGroup == object)
var <- clean_data %>%
select(Energy_kcal:Zinc_mg)
GGally::ggcorr(var)
Steps for performing the PCA on the data:
data_scaled <- scale(var)
pca_data <- prcomp(data_scaled, center = FALSE, scale. = FALSE)
summary(pca_data)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3533 2.0017 1.5877 1.43916 1.08355 1.02965 0.92777
## Proportion of Variance 0.2408 0.1742 0.1096 0.09005 0.05105 0.04609 0.03742
## Cumulative Proportion 0.2408 0.4150 0.5246 0.61465 0.66570 0.71180 0.74922
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.89312 0.87936 0.81115 0.79179 0.69240 0.67572 0.63235
## Proportion of Variance 0.03468 0.03362 0.02861 0.02726 0.02084 0.01985 0.01739
## Cumulative Proportion 0.78390 0.81752 0.84613 0.87339 0.89423 0.91408 0.93147
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.61343 0.55427 0.4748 0.43995 0.40709 0.34613 0.3034
## Proportion of Variance 0.01636 0.01336 0.0098 0.00842 0.00721 0.00521 0.0040
## Cumulative Proportion 0.94783 0.96119 0.9710 0.97940 0.98661 0.99182 0.9958
## PC22 PC23
## Standard deviation 0.2957 0.09322
## Proportion of Variance 0.0038 0.00038
## Cumulative Proportion 0.9996 1.00000
var_explained_df <- data.frame(PC = paste0("PC", 1:23),
var_explained = summary(pca_data)$importance[2, ])
var_explained_df
var_explained_df %>%
ggplot(aes(x = PC, y = var_explained, group = 1)) +
geom_point() +
geom_line() +
labs(title = "Scree plot: PCA",
y = 'Cumulative variation explained',
x = 'PC')
pca_loadings <- as.data.frame(pca_data$rotation) %>%
dplyr::select(PC1, PC2, PC3) %>%
mutate(variable = rownames(pca_data$rotation)) %>%
pivot_longer(
cols = c('PC1', 'PC2', 'PC3'),
names_to = 'PC',
values_to = 'loadings'
)
ggplot(pca_loadings, aes(x = variable,
y = loadings)) +
geom_bar(stat = 'identity') +
facet_wrap( ~ PC)
pca_scores <- as.data.frame(pca_data$x)
pca_scores <- pca_scores %>%
mutate(FoodGroup = clean_data$FoodGroup)
head(pca_scores)
plot1 <- ggplot(pca_scores, aes(x = PC1,
y = PC2,
col = FoodGroup)) +
geom_point()
ggplotly(plot1)
plot2 <- ggplot(pca_scores, aes(x = PC1,
y = PC3,
col = FoodGroup)) +
geom_point()
ggplotly(plot2)
plot3 <- ggplot(pca_scores, aes(x = PC2,
y = PC3,
col = FoodGroup)) +
geom_point()
ggplotly(plot3)
There is a major outlier on the plots above – which food is the outlier? Remove the outlier from your data.
Perform PCA again on the dataset without the outlier (steps 1-4 in the Performing PCA section above) and look at the loadings of the first 3 PCs. Have these changed? Investigate and comment on what could have caused any changes.
Describe what you see in the plots of the scores and interpret this in conjunction with the loadings that you observed for the PCs.